-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test approach for detecting issues #93
base: main
Are you sure you want to change the base?
Fix test approach for detecting issues #93
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #93 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 4 4
Lines 95 97 +2
=========================================
+ Hits 95 97 +2 ☔ View full report in Codecov by Sentry. |
ratelimit_test.go
Outdated
"github.com/stretchr/testify/assert" | ||
) | ||
|
||
// testTime is a fake time used for testing. | ||
type testTime struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is copied over from golang/time@579cf78, could we:
(1) put this is a separate file
(2) attribute the source clearly
(3) call out differences in the implementation? For example, I see that advanceUnlocked
is different than the reference implementation, but I cannot easily tell why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Thanks! This looks complicated, it'll take me some time to grok this (sorry) - I would expect maybe some time over the weekend. It's interesting that benbjohnson/clock#42 was fixed by @abhinav, looks like the clock bit us a few times.
@storozhukBM - would you like to partially revert #91? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in |
2 Questions: |
(1) I'd do a separate diff. This one could be about fixing the fake time.
(2) Your call really. I would slightly to push the broken version for now,
and then fix it, but happy either way.
tdlr; is that I'd like to play more with the clock implementations/races
(or at least understand your implementation - which I currently don't)
before landing this PR.
…On Mon, Jun 27, 2022 at 5:13 PM Bohdan Storozhuk ***@***.***> wrote:
@storozhukBM <https://github.com/storozhukBM> - would you like to
partially revert #91 <#91>? We
could add your new implementation to the repo, but not make it the default
(yet). This way we could track the (1) clock change (2) bug fix in the same
PR eventually. I can also do this, but I thought you might want the
addition to be attributed to you in git log :)
2 Questions:
Should I do the revert in a separate PR or this one?
Should I fix the but in the int64 based implementation right away, since I
already have a fix?
—
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACG3UWENEQ5O6BZERYC7YDVRI7TJANCNFSM5Z7XTTVA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Let's do the following thing:
|
This limiter was introduced and merged in the following PR #85 Later @twelsh-aw found an issue with this implementation #90 So @rabbbit reverted this change in #91 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.
Done, #94 merged. sorry for the delay. Thanks!
…On Wed, Jun 29, 2022 at 1:29 PM Bohdan Storozhuk ***@***.***> wrote:
@rabbbit <https://github.com/rabbbit>
First step PR #94 <#94>
—
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACG3URGCAFOGW4QFOJ636TVRSWZLANCNFSM5Z7XTTVA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This limiter was introduced and merged in the following PR uber-go#85 Later @twelsh-aw found an issue with this implementation uber-go#90 So @rabbbit reverted this change in uber-go#91 Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.
9d5dd18
to
1053ca8
Compare
# Conflicts: # ratelimit_test.go
Now we can see that new test approach detects issues with |
In #93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```
In #93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```
ratelimit_mock_time_test.go
Outdated
https://github.com/golang/time/commit/579cf78fd858857c0d766e0d63eb2b0ccf29f436 | ||
|
||
Modified parts: | ||
- advanceUnlocked method uses in-place filtering of timers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the goal of in-place filtering just performance of the fake-clock?
I don't think this feels very important (?) - I'd rather have simpler code that's slightly slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, this approach looks simpler as well as more performant, but check the original and give me your opinion, if you think we should keep the original, I don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just that the whole i
, j
looks like we would need tests for tests (off by one errors).
I think I would rather pull the other version as is, and then push a separate PR to fix the copying with clear changes. If we worried about the performance though, wouldn't we want to sort once on insert, rather than advance
?
But:
- see my other comment below
- I understand if fixing the time is not your primary focus for now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can go back to the prev version.
ratelimit_test.go
Outdated
for { | ||
move() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I think this needs a much larger explanation - perhaps as comments in code?
Am I reading this right that this is an infinite loop, trying to move to infinitely move to the next timer ... by recovering from a panic?
Irrespective of goroutine leaks, using panics for signaling - do you think the maxDuration
is not useful here? It seems we should be able to just pass the duration into advance
, theoretically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Irrespective of goroutine leaks
I think 1 goroutine leak per test case is not a big deal, but if you want, I can add some exit logic here.
using panics for signaling
Code from Ian just panics when you try to run the next timer, and there is nothing to run. We can potentially change that behavior to not handle panics, but I wanted to limit changes in that code.
Sorry, but I think this needs a much larger explanation - perhaps as comments in code?
I'll write some description here first and when we clarify all questions about it, I'll move it into code comments:
testTime
internally has a queue of all timers currently registered within the clock. In our case, there will be timers for goroutines that try to consume permissions and timers for goroutines that should run test asserts.
When we call testTime.advanceToTimer() it basically shifts time to the next registered timer, runs it, and also runs runtime.Gosched() multiple times to give a chance for other goroutines to run and react to what happened in the current timer. Usually, when we run runtime.Gosched() other goroutines will observe the new state, make their moves, and potentially create new timers inside this clock to wait for some
other state change.
If testTime.advanceToTimer() panics, it means that there are no timers to run, and it can happen if the test is already finished or it didn't manage to set up the full test case, so we just sleep a bit to allow and try again to run timers.
Do you think the maxDuration is not useful here?
No, we can use that if we want to get out of the currently infinite loop, we can check on every iteration if testTime is already passed maxDuration.
It seems we should be able to just pass the duration into advance, theoretically?
Yes, but since we advance time in different goroutines in this approach, it is necessary for us to do it one step at a time and give other goroutines the opportunity to make their steps as well. There also can be situations where we start to advance time in this goroutine before test goroutines spawned all permission consumers and asserting goroutines, so we need to just wait for everything to setup, that is what we do when panic happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps naively, but do we need both
- advanceToTimer
- advance(Unlocked?)
Our test runner knows exactly how much time needs to move (maxDuration
). We can just pass it into advance
. advanceUnlocked
already does all the runtime.Gosched
needed.
And soit seem to me that advance
should theoretically be enough?
If the problem is the initial goroutine setup, should we just add the 16*runtime.Gosched
at the beginning of advance
to let the scheduler kick in?
This would also help with r.clock.advance(time.Nanosecond)
you have below.
I understand this kinda modifies the code we copy-paste, but I'd feel much more comfortable with we decreased the API by 50% here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, it is hard for me to make changes because tests do not pass until we merge #95
I agree with the comments and changes that you propose right now, but they don't change the approach in general.
I propose to merge #95, and then I can apply your suggested changes, or we can merge this PR as it is right now, and I'll make these improvements in subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged #95. I might still want to re-review it later on, it feels like some simplification is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this branch with new version and this should potentially fix all tests, event though I'm not sure how your new test for initial calls will work 🤞
In uber-go#93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```
This PR fixes the issue found by @twelsh-aw with int64 based implementation #90 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
@rabbbit |
It's possible that mutex behaviour was broken, but we didn't detect it
because of the broken clock :|
…On Wed, Jul 13, 2022 at 4:15 PM Bohdan Storozhuk ***@***.***> wrote:
@rabbbit <https://github.com/rabbbit>
After merging with the recent main branch, all tests pass, except
TestInitial for mutex rate limiter for some reason. I'll investigate what
happens with it when I have time in the next few days.
—
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACG3UQFR7DHSNS6T3LYLADVT5EXLANCNFSM5Z7XTTVA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
No, the use of |
Fetch exact numbers for goroutines in runtime to make that amount of Gosched calls.
@rabbbit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost SGTM:
- I don't think we need the
GetName
(?) - I don't like the amount of
time.Sleep
but perhaps it's unavoidable
I've been wondering about some simplified fake clock library - given the amount of issues we've seen it seems it must be possible to make it simpler -__-
ratelimit_test.go
Outdated
type testRunner interface { | ||
getName() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
The name is included in the test-name automatically via tt.run (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
"github.com/stretchr/testify/assert" | ||
) | ||
|
||
const advanceBackoffDuration = 5 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we need this? Is the gosched
not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can notice that gosched
calls happen under lock. It means that all goroutines that would like to get the current time from this clock or register some additional timer won't be able to do so, even if we run these gosched calls. In our case, it means that most "startTaking" goroutines will never work while we step through timers, but they need the most time resource in our tests run.
So we actually need both types of pauses between runs. One is using gosched
under lock for all goroutines that were waiting for the timer to click and are not blocked on mutex of this clock.
The other one is for all goroutines that are blocked on mutex of this clock and need to make some progress between timer advances.
@@ -102,6 +115,7 @@ func (r *runnerImpl) startTaking(rls ...Limiter) { | |||
for _, rl := range rls { | |||
rl.Take() | |||
} | |||
r.clock.advance(time.Nanosecond, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we need this one as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this one imitates time progressions between rl.Take()
calls. With a real clock, when you call rl.Take()
do something and call rl.Take()
again time.Now()
will move forward in time, so we want to imitate that.
If you take a look at the libraries we have used in the past, like https://github.com/benbjohnson/clock/ it is MUCH more complex, has concurrency bugs, and `time. Sleep's as well here I researched every library that I was able to find on Github, and this approach from One more consideration that we should take into account is that we can find a lib or implementation that will pass all our tests but is going to detect the real bug. After we change the time library, we should use mutation testing at least manually introduce some bugs into the existing rate limiter implementations that we have and make sure that our tests with the new clock library detect the issue. |
Hey @rabbbit, what do you think? Should we move forward with this PR? |
@rabbbit can we find someone form Uber to make a depth review? |
We maybe can get something in standard library for this golang/go#67434 |
The problem in the previous approach was that mock time literally didn't run between
limite.Take()
calls, sonow.Sub(oldState.last)
would always be 0 that can be problematic for some issues detection.This PR adds time increments between
limite.Take()
calls in(r *runnerImpl) startTaking
. Unfortunately our library for clock mockinggithub.com/benbjohnson/clock
has some races reported here benbjohnson/clock#44 and here benbjohnson/clock#42.So I decided to use the time mocking approach that Ian Lance Taylor used 19 days ago in https://github.com/golang/time repository here golang/time@579cf78
Note: this tests approach detects an error in int64 based limiter implementation that the previous test didn't manage to catch.